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Abstract 

This paper proposes a single form for statistical models that accommodates 
a broad range of models, from ordinary least squares to agent-based microsim¬ 
ulations. The definition makes it almost trivial to define morphisms to trans¬ 
form and combine existing models to produce new models. It offers a unified 
means of expressing and implementing methods that are typically given disparate 
treatment in the literature, including transformations via differentiable functions, 
Bayesian updating, multi-level and other types of composed models, Markov chain 
Monte Carlo, and several other common procedures. It especially offers benefit to 
simulation-type models, because of the value in being able to build complex mod¬ 
els from simple parts, easily calculate robustness measures for simulation statistics 
and, where appropriate, test hypotheses. Running examples will be given using 
Apophenia , an open-source software library based on the model form and transfor¬ 
mations described here. 


1 Introduction 

This paper presents a formal definition of the informally-understood concept of 
statistical models and discusses the benefits of that definition. 

The problem is not that there is no such formal definition, but that there are too 
many, as each subfield of statistics develops its own. The definition here hopes to 
strike a balance between those definitions that are structured to fit only one genre of 
modeling, and definitions that are so unstructured that they provide little advantage 
to users. 

This paper describes a model as a collection of elements including a data space, 
a parameter space, and functions mapping over those spaces: estimation, like¬ 
lihood, random number generation (RNG), and cumulative distribution function 
(CDF). 

The formalization provides sufficient structure to allow the definition of con¬ 
sistency rules among the components of a model and well-defined transformations 
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of models. Formally, the set of models combined with the transformations form 
an algebraic system. Basic models and transformation routines are the nouns and 
verbs of a modeling language allowing authors to express and analyze complex 
models. 

Informally, researchers often describe a situation to be described via a narra¬ 
tive, maybe something like: first, a Normally-distributed random event occurs, 
then it interacts in a specific way with another random event with a Beta distribu¬ 
tion to produce an aggregate, then the outcome is a linear transformation of the 
aggregate. But it can be difficult to formalize the narrative into something that can 
be estimated and compared to other narratives or a data set. The algebraic system 
solves the problem, by providing a set of mappings that each take in a well-formed 
model (or models), and outputs a well-formed model. When each segment of the 
narrative is a model, it is easy to build the full storyline by applying successive 
transformations. 

In contrast, a simple transformation of the structures defined in many textbooks 
and built in to common modeling platforms may produce something outside of the 
space of models defined by the platform or textbook. Such inegalitarian treatment 
of models causes a “computational cliff”, where a slight variant of a built-in model 
(such as replacing a Normal distribution with a truncated Normal) may require an 
entire rewrite of procedures written around the built-in model. The computational 
cliff encourages researchers to use slightly inappropriate stock models rather than 
modify them for the situation at hand. 

More broadly, authors may provide many tools for their specific favored genre 
of modeling and relegate others to a can-be-implemented status. For packages 
where agent-based modeling (ABM) is a first-class paradigm, including Repast 
[North et ah, 2006], Sugarscape [Epstein and Axtell, 1996], Netlogo [Wilensky, 

1999], and several Java libraries, implementing an ordinary least squares (OLS) 
regression is technically possible but practically infeasible. Conversely, OLS is a 
first-class modeling paradigm in any of a number of packages such as R. but it 
is technically possible but practically infeasible to implement nontrivial ABMs in 
R. 1 

By writing tools around a standard, generic form for models, it becomes al¬ 
most trivial to use tools across modeling traditions, such as applying traditional 
statistical tools like maximum likelihood methods, variance estimation, and appro¬ 
priate hypothesis tests to agent-based or machine learning models. This paper will 
include several examples that cross modeling traditions: 

• Example 4 shows how Ordinary Least Squares can be fit into the model 
structure here. 

• Example 5 presents a simulation of the production of a network of nodes, 
and describes the model in the same model structure. 

• Example 8 compares the link distributions generated by the model of Exam¬ 
ple 5 to an Exponential distribution, and finds finds the best-fitting parameter 
for the Exponential distribution. 

• Example 10 describes a mixture of populations from distinct distributions, 

1 The R community provides support for this claim: the CRAN (Comprehensive R Archive Network, 
modeled on TgX’s CTAN and Perl's CPAN) has several thousand packages, but a search for agent modeling 
packages turned up only packages like Rnetlogo, which use R as a front-end to systems where ABMs are 
first-class models. 
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and sets Bayesian priors on the parameters describing the various popula¬ 
tions. 

• Example 13 starts with a consumer preference model, wherein agents search 
for the optimal bundle of goods given a utility function and a budget con¬ 
straint. The model is transformed via a Bivariate Normal kernel smoother, 
and the optimal parameters of the transformed model are calculated given a 
fixed data set. 

• Example 14 is a model composed via an agent-based spatial search model 
as the data-generation process and a Weibull likelihood function for evalua¬ 
tion. By putting priors on the search model parameters, we find a posterior 
distribution over the Weibull parameters. 

The definition of a model and the set of transformations built around it provide 
a set of forms to be filled out for each model. For standard closed-form models, the 
blanks in the forms can be filled in with closed-form expressions. For other mod¬ 
els, each blank can be filled in using a well-defined computational algorithm. The 
process is therefore egalitarian in the sense that all parts of every form are com¬ 
plete for all models, but it remains inegalitarian in that the closed-form expression 
has arbitrary precision and computational algorithms are always an approxima¬ 
tion. For some model-algorithm combinations, a great deal of computing time is 
required to achieve a reasonable approximation; it is up to the model user to recog¬ 
nize these situations and accommodate them with more hardware and time, model 
simplifications, or by contributing a more efficient algorithm to the literature. 

Are there theoretical constraints to treating an arbitrary objective function, as 
given by a simulation or other mechanism, as a statistical model? For probability 
distributions defined as such, the integral of the likelihood over the data space 
given fixed parameters is defined to equal one (herein the unitary axiom). For an 
arbitrary objective function, the integral over all data may take any value. However, 
the unitary axiom is unnecessary for most of our purposes, where a simple ratio of 
likelihoods or other relative comparison is all that is needed. Because calculating 
the total density for a distribution so that it can be normalized to conform to the 
unitary axiom is an expensive operation (it is the motivation for a great deal of 
Bayesian computational machinery), we do so only when necessary. 

Apophenia [Klemens, 2008] is an open-source library of routines that im¬ 
plements several dozen models in a form similar to the one described here, and 
provides several functions and other transformations that make use of such mod¬ 
els. Apophenia demonstrates that the concepts described in this paper can be 
put to practice: it was used to fit the running examples in this paper, and, as a 
back-end to functions written for an R front-end, is used by the U.S. Census Bu¬ 
reau for some disclosure avoidance operations on the 2010 Census and Ameri¬ 
can Community Survey. An appendix provides further discussion of Apophenia 
and a set of full examples; the code used to generate the examples is available at 
http://github.com/b-k/modeling_examples. 

However, this paper has been written to be as platform-independent as possible, 
and readers who are tied to an existing computing platform should find much in 
this paper that could easily be implemented using the tools they have available. 
Readers who expect that the model structure and transformations described herein 
are easy to reimplement via their preferred platform are encouraged to do so. 

Section 2 considers the history of prior attempts to produce systems that cover 
a broad range of statistical models, and even the problem of defining the word 
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model. 

Section 3 presents the model definition used by this paper, as a collection of 
functions that either take in or output parameters or data. 

Given a well-defined set of models, it is possible to define morphisms mapping 
from the space of models back to the space of models. Section 4 presents several 
examples. 

Section 5 considers some immediate applications of a standard model form, 
such as means of hypothesis testing for an arbitrary model. 

The appendix provides several complete examples in code. They are not pseu¬ 
docode, but working examples, demonstrating how models and model transforma¬ 
tions would be implemented in practice, some egalitarian uses of slightly nonstan¬ 
dard models, a complex model built by composing simple models via morphisms, 
and two agent-based microsimulations. 


2 Prior works 

In a 2007 talk about the future of statistics, Bradley Efron described the present day 
as “an era of ragtag heuristics, propelled with energy but with no guiding direc¬ 
tion.” [Efron, 2007] Indeed, a modern researcher has many different interpretations 
of the word model at his or her disposal, each backed by a research community pro¬ 
pelled with energy in its own direction: generalized linear models (GLMs), simple 
distributions like the Normal or Dirichlet, hierarchical nests of models, Markov 
chains, Bayesian frameworks, discrete event simulation, agent-based simulation, 
and so on. But, in Efron's words, we are not yet seeing these ragtag heuristics 
“coalescing into a central vehicle for scientific discovery.” 

2.1 Computing literature 

Demand for a unified model framework seems to have stemmed mostly from de¬ 
signers of new computing systems as they confronted the problem of building 
model objects. 

Some designs take existing formal systems and add a probabilistic element, 
culminating in stochastic programming languages such as Church [Goodman et al.], 
BLOG [Milch et ah, 2005], and Venture [Mansinghka et ah, 2014], Models focus 
on tracking states of the world, from which Bayesian nets (akin to many structural 
equation modeling tools) can be formed. It would be easy and natural to imple¬ 
ment the algebraic system of models described in this paper using these stochastic 
programming languages. 

The HLearn package for the Haskell programming language [Izbicki, 2013] 
casts a variety of statistical models in terms of underlying algebraic structures for 
the purposes of parallelizing their implementation. It restricts itself to model trans¬ 
formations expressible as monoids. Lin [2013] also goes into detail about the value 
of a strict algebraic structure, using simple statistics as examples. 

BUGS (Bayesian Inference Using Gibbs Sampling, Lunn et al. [2012]) has 
been reimplemented several times in several variants [JAGS (Just Another Gibbs 
Sampler), OpenBugs [Lunn et ah, 2009]], each of which has a common flavor. As 
per the names, the focus is on chaining together models via Gibbs sampling. These 
packages therefore offer strong model-composition features, but make little effort 
to provide an egalitarian model object. 
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Object-oriented systems like C++ or Java require a class or object declaration, 
which defines the form to which all objects in the class must conform. Zelig [Imai 
et al., 2008] follows this lead, defining a class format into which it places a large 
set of contributed models. The class structure simplifies the workflow for models 
that explain a single outcome variable using a set of input variables—basically, the 
traditional generalized linear model (GLM) framework and related models. 

Lisp-Stat [Tierney, 1990] follows the mold of object-oriented systems based 
on an inheritance structure, where specific models inherit from a relatively abstract 
model. Specific types of GLM, for example, would inherit from a GLM prototype 
(which Tierney [1991] implements). A model may respond to any from a long 
list of messages, including simple requests like : coef-estimates to get the es¬ 
timated coefficients, up to more specialized and computation-intensive commands 
like : cooks-distance. 

The lead author of Lisp-stat explicitly rejects class formats as too restrictive 
[Tierney, 1990, p 206], Instead, a model is a Lisp-style list of data or procedures, 
and authors may add as many new items to the list as are useful, without class- 
defined constraints. Thus, the system's key strength is that it is flexible, and new 
functions may easily be inserted into any model structure. 

For our purposes, the system’s key failing is that it is flexible, and new func¬ 
tions may easily be inserted into any model structure. Each class of model will 
have its own set of functions that make sense for its genre of modeling, so there 
is no guarantee that the internals of any two models will match sufficiently well 
that one could be swapped for another in a given procedure. If we wish to apply a 
transformation to a model, we will need to modify every interface function in the 
model, which we can do only if we have a comprehensive and consistent list of 
such functions. 

The problem, then, is to define a class structure that is broad enough to usefully 
describe any model, but sufficiently limited that we can be guaranteed that every 
model implements every element included in the definition. 

2.2 Theory literature 

Away from the computer, there is largely consensus on the definition of a model. 

Begin with a data space, D, and a parameter space, P. In practice, these spaces 
are typically a space of IV-dimensional real numbers, R^; a sequence of discrete 
categories, Ci x • • • x Cjv (e.g., sex x age bracket x race x...); a P-dimensional 
sequence of natural numbers N p ; or a Cartesian product of these spaces. The 
numeric values typically have units attached. The parameters may be a list of not- 
predetermined length; if each element of the list is in R then the full list is in the 
space consisting of the set of sets {0,R,R 1 ,R 2 ,...}; a model over a parameter 
space whose elements do not have predetermined dimension is referred to using 
the counterintuitive but customary term non-parametric model. In some cases, one 
or both of D and P may be the empty set. Elements of a given data space B m will 
be written as d m ; elements of a given parameter space P m will be written as p m . 

The D space will need to be totally ordered (that is, a < operation is defined 
so that di < do or do < di, Vdi, &2 £ D), for the purposes of the cumulative 
density function (CDF), which is primarily for hypothesis testing. Even for cases 
like unordered categories, the CDF can still have real-world meaning. For ex¬ 
ample, given unordered categories A, B, C , D and imposed alphabetical ordering, 
CDF(C) — CDF(B) is the density on only category C. 
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McCullagh [2002] gives a long list of authors who either explicitly or implicitly 
use a definition of parameterized statistical model akin to the following, given a 
data space D, a space of parameter sets P, and the nonnegative real numbers R J: 

Definition 1 A parameterized statistical model is a parameter set P together with 
a function P —> (L : D —y Rj), which assigns to each parameter point p £ P a 
probability distribution L p on D. 

See also Hill [1990, p 119], who gives another definition following this mold. 

A Normal model in this context is a function Af : (p, a) —y (L : R —y Rj), 
where p is read as the mean, and a the standard deviation. One could also express 
the mapping via a single function in three variables, L(d, p, a), where d is a scalar 
data point. Conditioning on a single point in the parameter space, such as (p = 0, 
a = 1), fixes a single likelihood function. L(d, p, a\p = 0,cr = 1), which is a 
function of only d. 2 

Definition 1 casts a model as a simple collection of spaces and mappings be¬ 
tween those spaces, consisting of 


We have a function that takes inputs from both D and P, but one could add other 
functions to the collection including the sets D, P, and Rj, including functions 
that take inputs only from D and output to P, such as a routine to estimate the most 
likely model parameters; or functions that take in only an element of P and output 
to D, such as an expected value calculation or a random number generator. 

This paper argues that there is real benefit to including some of these other 
functions in the collection defining a model. 


3 Definitions and examples 

Here is the definition of a model that will be used for the remainder of the paper: 

Definition 2 A model is a collection consisting of the sets D, P, N, and R J and 
the following mappings between sets: 

• Likelihood: L : (D,P) —>• Rq\ 

2 Given a function L(-, •) taking in data d and parameter vector p and reporting the likelihood of the 
pair, one could fix p and thus generate what is called a probability function, L(-, p). Or, one could fix d 
and produce what is called a likelihood function, L(d, •). RA Fisher explains that the two should remain 
separate in interpretation: 

... [I]n 1922, I proposed the term ‘likelihood,’ in view of the fact that, with respect to [the 
parameter], it is not a probability, and does not obey the laws of probability, while at the same 
time it bears to the problem of rational choice among the possible values of [the parameter] a 
relation similar to that which probability bears to the problem of predicting events in games 
of chance.... Whereas, however, in relation to psychological judgment, likelihood has some 
resemblance to probability, the two concepts are wholly distinct....” 

—Fisher [1934, p 287] 

However, setting a rule that L(-, p) should be interpreted differently from L(d, •) does not help us deter¬ 
mine what to call L(*, •) itself. Further, the computer is entirely indifferent to the distinction between the 
ostensibly objective and subjective views of L(-, •). In this paper, I use the notation L instead of P to reduce 
ambiguity with the parameter space, for all variants of L(-, •). 
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• Estimation: Est : D —> P. 

• (Pseudo-)random number generator: RNG : (P,N) —> D. 

• Cumulative distribution function: CDF : (D, P) —r [0,1]. 

The L function might also be described as an objective function, which has 
fewer associations than the term likelihood. 

A (pseudo-)random number generator (RNG or PRNG) is a determinstic func¬ 
tion of input parameters and an input seed, typically a large natural number. The 
RNG function then deterministically maps each (p £ P, n £ N) combination 
to a relatively unpredictable element of D. With no second argument, RNG(p) 
indicates a representative draw d £ D. 

Some definitions below will use f SeB L(S, p )dS, which requires that D have a 
metric such that the integral over the entire space is well-defined. If D is discrete, 
it will be understood that f SeB is the sum Definition 5 will require that 

f p L{ d, p)dp (or the corresponding sum) be well-defined. 

A data set, written as D, is a collection of N elements from a single data 
space, where N is any integer > 1. If the N elements are independent, then for 
any likelihood function p(-), we define p(D) = p(di)p(d 2 ) • • -p(djv), so the 
sequel will primarily discuss likelihood functions over a single data element. 

This model definition is a mix of mathematics and the practical reality of how 
models are used, and later sections will demonstrate why a ‘larger’ definition of 
a model has practical benefit. For example, the CDF is included because hypoth¬ 
esis tests are typically an assertion regarding some portion of the CDF. Even so, 
a real-world implementation as a computational structure or object would likely 
digress from this bundle of functions. One might prefer to include a log likelihood 
function, rather than the plain likelihood function listed above, or might include an 
entropy function, even though entropy can be calculated using the elements already 
given. 

We are interested only in models that have basic internal consistency. 
Definition 3 A model is ML-consistent iff: 

• f B L(5 , p)d(5 is greater than zero and finite for any fixed p £ P iff D ^ 0. 

• For given datum d, argmax p L(d,p) = EST(d). That is, the estimation 
function of a model produces a maximum likelihood estimate (MLE) of the 
model’s likelihood function. 

• The odds of drawing a data point from the RNG is proportional to its like¬ 
lihood. That is, for any fixed p £ P and any di, cR £ D, (probability 
of drawing di = RNG(p)) • L(da,p) = (probability of drawing d .2 = 
RNG(p)) • T(di,p). 

• For any fixed (p £ P, d £ D), the probability that RNG(p) < d equals 

CDF(d, p). 

Let M indicate the space of ML-consistent collections. This is the set of models 
that this paper considers. Elements of M will be distinguished by a subscript, such 
as the Normal distribution model, Msf, and where needed the constituent functions 
will be given matching subscripts: Laa, EsTat, RNGat, and CDFa/. 

There is typically an additional requirement (one of the Kolmogorov axioms) 
that f D L(5, p)dS = 1 for any fixed p; note that such a constraint is not imposed 
here. However, because the CDF function conforms to the RNG function, which 
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is proportional to the L function, CDF(d. p) approaches one as d gets larger, for 
any p. Dropping the unitary axiom simplifies most of the analysis, and Section 4.1 
will consider options for those cases where we need to know f B L(S, p)dS. 

The consistency conditions for the CDF and RNG are largely definitions of 
these terms. The condition that EST(d) is an MLE of the likelihood function 
given the data is nontrivial. Maximum likelihood has desirable properties, like 
how it is an unbiased estimator as the data size —> oo under weak and common 
assumptions, and has undesirable properties, like how it is often a biased estimator 
for small samples [Pawitan, 2001], 

One can easily find useful models in the literature where the estimation rou¬ 
tine does not produce an MLE, such as estimations based on minimizing mean- 
squared error (MSE). One could define MSE-consistency as similar to Definition 
3 but with the MLE condition replaced with estimation via minimized MSE. This 
would define a new space M mse , with a new set of transformations mapping from 

M mse ^ M mse . 

Other potentially interesting rules could include Kullback-Leibler divergence 
minimizing consistency, entropy-maximizing consistency, or a method of moments- 
style rule matching model moments to data moments. Implementing a system of 
modeling based on KL-consistency, entropy-consistency, or any other option is left 
as an exercise for the reader. 

Example 1: The Normal distribution Closed-form or computationally- 
efficient methods have been derived for every function in the model definition: 

• B = R. 

• P = lx R + , representing the mean /i and standard deviation a. 

• Likelihood: A 2 ex P( )• 

• Estimation: jj, = mean of D; a = — /() 2 / n - 

• RNG: No closed form expression, but see Devroye [1986]. 

• CDF: No closed-form expression, but see Press et al. [1988]. 

Let Mj\r indicate this model. 

Example 2: The Probability Mass Function (PMF) A data set can be 

treated as a probability distribution, where the likelihood of a given observation 
is proportional to the frequency with which the observation appears in the data. 
The PMF model will appear frequently throughout this paper, when a closed-form 
model can not be calculated or derived. 

Let D/ be the initial data set used to estimate the PMF, and its individual 
elements d/i, d/ 2 , • • •, d/jv. 

• D: Typically or a short list of categories. 

• P = R^ x D^, representing weights (wi ,..., Wn) € R^ assigned to N 
observations from D. 

• Estimation: p = ((u>i,d/i), (w 2 ,d J2 ),..., (w N ,d IN )). 

• L(d, p): If d 0 D/, zero. Else, the weight assigned to d. 

• RNG: weighted draw from p. 

• CDF(d, p): sort d/s according to <; sum weights up to d. 



Because of the format of P, a model based on four observations in R will 
be different from one based on five observations in R, so this template defines a 
multitude of elements of M. This paper will use Mpmf to refer to any of these 
models. 

In some cases, kernel smoothing or moving average transformations can im¬ 
prove how well a PMF represents the underlying process. 


Example 3: Coin flip A coin-flip model where heads and tails are equiprob- 
able could be expressed via a PMF estimated using the two-observation data set 
D i = {heads, tails}. Then p = Est(D/) = {(|, heads), (|, tails)}. 


Example 4: Ordinary Least Squares OLS is typically described via an 
equation Y = f3 X + e, relating one subset of the input data, the ‘independent vari¬ 
ables’ X, to an output variable, the ‘dependent’ Y; and e is a Normally-distributed 
error term. There are two considerations required to fit OLS into M. 

First, the process of modeling for an OLS enthusiast consists of selecting a 
set of independent variables. As with Mpmf, slightly different data spaces pro¬ 
duce different models all sharing a common template: an OLS model where D 
=[outcome, age, sex, log(income)] is distinct from one where D =[outcome, age, 
log(income)]. The model is a joint distribution across the full data space including 
both independent and dependent variables. 

Second, the likelihood of a given point (Y, X) under the OLS model is calcu¬ 
lated using the likelihood from the Normal model, Lat ((Y — /3X), a) [Greene, 
1990, p 144], The integral of this likelihood over all possible values of (Y, X) is 
infinite, so the development here adds additional structure to reduce the likelihood 
to a finite integral, using a PMF model calculated using the X portion of the input 
data set. There are infinitely many other developments possible. 


• D: as specified by the model author, with dimension N. 

• R = R n X Rj, indicating ((3, a), including a coefficient on a constant 
column 1 inserted as the first column of X. 


• Estimation: /3 = (X'X)" 1 (X'Y) 
.Likelihood: { ^ ((Y - /JX), a) 


xe D x 
X£D x 


• RNG: Set ppmf = EsTpmf(D.y); draw X = RNGpmf(ppmf); draw e from 
RNGat(0, a); set Y = /3X + e. 


• CDF: Write the distribution of (Y — /3X) as the sum of independent Nor¬ 
mals, Vo ~ M{Y,a/N) plus Vi ~ U{—piX\,a/N) ...plus V N ~ 
JV(—/3nX n , a/N)', calculate total CDF accordingly. 


3.1 Filling in missing elements 

Consider two authors who wish to fit their models into the model form. The first 
is working with an Exponential distribution, which has well-known methods for 
calculating parameter estimates, the CDF, and so on. When implementing the 
model, this author will want to make use of these efficient routines. 

The second author has developed a new function, say a simulation or an ec¬ 
centric relationship between dependent and independent variables, has embodied 
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it in a likelihood function, and now wishes to ask questions that require estimating 
the most likely parameters give a data set, or making random draws from the PDF 
implied by the likelihood function. That is, the modeler wishes to use a full model 
but only has time to write down the likelihood. This author will need a model 
implementation that fills itself in using only one or two author-defined elements. 

Every item from this point to the end of the paper will have a default routine 
that can be evaluated for any M £ M and some special-case routines for models 
where efficient shortcuts or closed-form solutions exist. Because the purpose of 
this paper is to show that any type of model can be given egalitarian treatment, the 
focus will be on the default algorithm, but in no case are we obligated to use the 
default when there is a closed-form solution available. 

The 'larger" collection of Definition 2 more readily handles sets of default and 
model-specific cases than the ‘smaller" Definition 1. There are hooks for model- 
specific estimation. RNG, and CDF routines, and defaults can be provided in sev¬ 
eral directions: 

L —> EST Maximum likelihood estimation routines are black-box routines that 
use the likelihood function regardless of its form [Press et ah, 1988], These rou¬ 
tines propose a parameter pi and use the likelihood function to calculate L( d, pi), 
then jump to further samples P 2 •. . Pn based on the values L(d, pa),..., L(d, p n ) 
Thus, one can use these optimizers to implement an estimation routine given only 
a likelihood function. Search routines are designed for the optimization of general 
functions, so probability-like properties that the function being optimized integrate 
to one—or even that it evaluate to a nonnegative value—are not required. 

L -4 RNG In one dimension, adaptive rejection Markov sampling [Gilks et al., 
1995] uses likelihoods as a black box to make draws from the likelihood at the cor¬ 
rect rate. ARMS also considers only the ratio of likelihoods, and therefore has no 
requirement that its input function integrate to one. In multiple dimensions, Sec¬ 
tion 4.8.2 uses a Metropolis-Hasting MCMC algorithm to draw from an arbitrary 
data space. 

RNG -A L Make a few thousand random draws from the RNG; write each 
down as a single observation in a new data set (this is the stochastic variant of 
memoization of a function). Use the draws to estimate a PMF model approximat¬ 
ing the original model. The likelihood represented by the PMF is a rough discrete 
approximation of the true likelihood, but it is consistent: as the number of draws 
approaches infinity, the error goes to zero. If D is continuous, a smoothing transfor¬ 
mation (cubic splines, kernel density, moving average) may improve the accuracy 
of the PMF. 

The resultant PMF model has L and CDF methods, which may be used as 
approximations to for the main model's L and CDF. 

RNG -» CDF Make random draws and count the percentage of draws less 
than the given point d. 

CDF -A F Calculate L via numeric deltas from CDF. 
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Figure 1: A diagram of the listed means of filling in one element of a model with 
another. 

CDF -> RNG Given O = R, draw r =a random value from a Uniform(0,1) 
distribution. Via binary search, secant method, or Newton-Raphson method, locate 
d such that CDF(d, p) = r. Then d is an appropriately-weighed random draw 
from the model with CDF [Devroye, 1986, pp 32-33]. 

For most of these directions (L —> EST, RNG —> L, RNG —> CDF), the 
ML-consistency requirements are met almost trivially; for the L —>■ RNG and 
CDF —> RNG algorithms, see the given citations for discussion of how the algo¬ 
rithms generate an RNG that is ML-consistent with the input likelihood and CDF. 

Fiture 1 shows the directions presented here, clarifying that one can begin with 
any of L, RNG, or CDF and arrive at any of the four function elements of the 
model collection. The Est element is lossy, in the sense that one Est function 
ignores a great deal about the L function away from the optimum, and so may be 
equivalent to a variety of L functions. 

Example 5: A network-generation simulation This network generation 
simulation is based on agents with a position in an underlying preference space 
(in this case, R). Each agent’s preferences are first generated as a draw from a 
A/”(0,1) distribution (though a will be allowed to vary in a variant below). Given 
two agents at positions p; and pj, they link with probability 1/(1 + |p; — Pj\). 

The output is a list of how many links agents 1 ,,N each have. It is useful to 
report sorted output, so the first dimension is the highest link count, down to the 
final dimension reporting the lowest link count. See Figure 2 for a summary of the 
algorithm. 

With N — 10, one run of the simulation produces output in N 10 . Figure 3 plots 
the first two dimensions of the result of of 10,000 runs. For visualization purposes, 
a small jitter was added to each point. 

The algorithm defines a model on P = 0 and D = N 10 . We have only defined 
the RNG portion of the model, but the above transformations indicate that the 
likelihood and other elements of the model can be well-defined by using the RNG. 

Consider the hypothesis that the number of links to the most popular person is 
less than or equal to four. The number of links for the second, third, ... persons 
are unrestricted, which we could express as the vacuous restriction that these link 
counts are less than or equal to nine. One would evaluate this hypothesis using the 
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1. Fixer = 1. 


2. For each agent i: 

(a) Draw a position pi using RNGjv(0, a). 

3. For each pair of agents, i and j: 

(a) Draw a random value r ~ W[0,1]. 

(b) Link iff r < . . 1 - T . 

4. Output data sorted count of connections for each person. 

Figure 2: A simulation to produce a random distribution of link densities. 


percentage of the total distribution that is below the point [4,9,9,..., 9], which is 

CDF a im ([4, 9,9,..., 9], 0) « 0.0533. 

Examples 13 and 14 in the appendix present some uses of more elaborate sim¬ 
ulation models. 


4 Operations on models 

This section presents a series of morphisms with signatures / : M -> M or 
/ : (M, M) — > M, that apply a well-defined transformation to every element 
of the input model(s) to produce a new model, including transformations via dif¬ 
ferentiable functions, fixing a parameter, or forming the product of independent 
distributions. 

The transformations map to M, so successive transformations can be chained 
together. As the examples below will demonstrate, the transformation paradigm 
allows models of arbitrary complexity to be built using simple steps. 

4.1 Conditional results 

But before writing down these transformations, a few things can be said about 
when two related models have equivalent elements, which will be useful in check¬ 
ing that our transformations are ML-consistent and related the way their descrip¬ 
tions claim they are. 

The likelihood of p These statements will rely on an additional set of defini¬ 
tions: 

Definition 4 

L(p)= [ L(S, p)dS 

and 

L(d|p) = L(d,p)/L(p). 

The unitary axiom requires that L(p) = 1, Vp; dropping that assumption mo¬ 
tivates the need for this definition. Recall that the definition of ML-consistency 
requires that oo > L( p) > 0 for all p £ P. 
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Figure 3: A distribution of the number of links to the two highest-ranked members of a 
ten-person network. Plot produced by running the simulation in Figure 2, then adding 
a jitter to each point. 


Definition 5 

L(d)= [ L(d, p)dp 

J VpGP 

and if oo > L(d) > 0, 

L(p|d) = L(d,p)/L(d). 

This definition requires integrability on P. 

Equivalent elements These definitions allow us to state when two models 
share Est, RNG, or CDF elements. 

Lemma 1 If for each d £ B there exists a constant it'd such that 
L 2 (d, p) = A' d Li (d, p), Vp 6 P, 
then EST 2 (d) = ESTi(d). 

This is sufficient but not necessary. 

Proof: For an arbitrary d £ O, p max = ESTi(d) and any other p x , 

Lt(d, Pmax) > Li(d, p :c ), so 

it'd • Ll(d, Pmax) > K d ■ Ll(d, p x ), so 
-^2 (d, Pmax') P L 2 {d,p x ). 

Thus, p max is also the argmax for L 2 (d, ■), so ESTi(d) = EST 2 (d). The premise 
of the lemma states that this holds for all d£fi. 0 
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Proposition 2 If for each deD there exists a constant A'd such that 
L 1 (p|d) = A' d L 2 (p|d),Vp€P, 
then ESTi(d) = EST 2 (d). 

Proof: Because L\ (d)/L 2 (d) is constant for any given d, (d'p> a con ~ 
stant iff is a constant; then Lemma 1 applies. 0 

Lemma 3 For each p £ P there exists a constant K p such that 
Li(d, p) = ApL 2 (d, p), Vd £ D 
iff RNGi(p) = RNG 2 (p) and CDFi(p) = CDF 2 (p). 

Proof: Given that L i (d. p) = A p L 2 (d. p). By the definition of RNG (p), 
the ratio of the probability of drawing di to the probability of drawing d 2 equals 
^i(d 2 P) ’ an ^ t ^ 1 ' s ra h° does not change when we multiply the numerator and 
denominator by A' p . The CDF element of the model is defined in terms of the 
RNG element. 

In the other direction, RNGi(p) = RNG 2 (p) for some p implies that, for any 
arbitrary di and d 2 . 


Li(di,p) = L 2 (di,p) 

Li (d 2 , p) L 2 (d 2 , p) ’ 

SO 

Li(di.p) = Li (d 2 , p) = 

L 2 (di,p) L 2 (d 2 , p) P ‘ 

Proposition 4 For each p£ P there exists a constant K p such that 
Li(d|p) = ApL 2 (d|p), Vd € D 
iff RNGi(p) = RNG 2 (p) and CDFi(p) = CDF 2 (p). 

Proof: Li(p)/L 2 (p) is constant for a given p. Then is constant iff 

l- 2 (d p) * s constan L then Lemma 3 applies. 0 

The two propositions have a nice symmetry: proportional shifts in L(pjd) 
do not affect Est; proportional shifts in L(d|p) (which may differ from L(d, p) 
because we dropped the unitary axiom) do not affect RNG or CDF. 

4.2 Parameter fixing 

The simplest means of reducing an JV-parameter model to an N — 1-parameter 
model is to fix a parameter at a constant value, such as turning a two-parameter 
Normal distribution, A f(fc,cr), into a one-parameter distribution, A /"(/it, 1). One 
may also write this as A/"(/r|cr = 1). 

Figure 4 shows the mapping from input model and constraint to constrained 
output model. The mapping M — > M breaks down into a mapping of the separate 
components of a model: parameter space to parameter space, likelihood function 
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Fix: (M, p/ e P) -» M 

► D' = D. 

► P' = P'. 

► L'(d,p') = L(d, pj x p'). 

► Est': use default of MLE. 

► RNG'(p') = RNG(p/ x p'). 

► CDF'(p') = CDF( P/ x p'). 


Figure 4: Definition of the Fix mapping. 


to likelihood function, and so on. The elements of the post-transform model are 

given a prime, such as L'(-, •), Est'(-),_ 

Some of these transformations require additional information to be fully de¬ 
scribed. which will be specified using subscripts. For example, a Normal model 
with a fixed at one will be written as FlX 0 - = i(Afjv')- These additional details are 
“private” to the generated model, used for the inner workings of the associated 
functions, but not used outside of the confines of the model. 

Let the parameter subset be fixed at py £ P, and the complement of its space 
in P be P', so P x P' = P. Elements in P' are notated as p'. Then Figure 4 defines 
the Fix transformation. 

The RNG and CDF equivalences are by Proposition 4. Because this general 
case offers no relation between L(p|d) and L'fp'ld, p/), we are unable to reuse 
Est, although there are many well-known special cases. 

4.3 Cross product 

In an example below, we will want a prior for the ^ and a of a Normal distribution. 
We could set the prior for fx as a Normal model and the prior for a as a square-root- 
transformed x“ model (see differentiable transformations, below). These could be 
used sequentially, first setting a and then setting /r, or this transformation could be 
used to bind together the two models to produce a model for (fx, a). 

The simplest case of binding together two models is where both have distinct 
parameter spaces and data spaces. There is therefore effectively no interaction 
between the models, and the independence assumption dictates that L^d'. p') = 
Li (di, pi) • L 2 (d 2 , P 2 ). Figure 5 specifies the rest of the CROSS transformation. 
The CROSS morphism is associative, e.g., 

Cross(Mi, Cross(M 2 , M 3 )) = Cross(Cross(Mi,M 2 ),M 3 ). 

Therefore, there is no ambiguity in writing CROSS (Mi, M 2 , M 3 ) to represent the 
combination of three models. 
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Cross: M 2 -»M 

► O' = Di x 0 2 . 

► P' = Pi x P 2 

► L'(d',p') = Li(di,pi) • L 2 (d 2 ,p 2 ). 

► Est' = (EsT(di),EsT(d 2 )) 

► RNG' = (RNGi(pi), RNG 2 (p 2 )). 

► CDF' = CDF 1 (p 1 )CDF 2 (p 2 ) 


Figure 5: Definition of the CROSS mapping. 


Mix: (M 2 , w e [0,1]) M 


► 0 / — Ox = 0 2 

► P' = Pi x P 2 x [0,1] 

► L m ( d,pjvr) = wLi(d, px) + (1 - w)L 2 ( d, p 2 ). 

► Est': Typically solved via EM algorithm 
[Dempster et al., 1977].“ 

► RNG': First, draw a random number r ~ U[ 0,1], 


RNG'(pm) 


RNGi(pi) r < w 
RNG 2 (p 2 ) r > w 


► CDF'( Pu) ) = 

«;CDF 1 (d, px) + (1 - tu)CDF 2 (d, p 2 ). 


"The EM algorithm is a loop consisting of the expected value func¬ 
tion, Fix, and the Est element of the parameter-fixed model. 


Figure 6: Definition of the Mix mapping. 
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DataTrunC: (M, Con d : D -A {0,1}) —A M 


► O' = the subset of D where Con^ is one. 


► P' = P. 


► L'(d,p) 


Md.p) 

In' L(S,p)dS' 


► Est': use MLE. 


► RNG': Draw d using RNG. If Con d (d) = 0, 
throw d out and try again. 

► CDF': Random draws from RNG' can be used to 
calculate the CDF. 


Figure 7: Definition of the DataTrunC mapping. 


4.4 Model mixing 

Consider two models, both over the same data space D but distinct parameter 
spaces Pi and P 2 . and a weighting w £ [0,1], and let Pm be the Cartesian product 
Pi x P2 x [0,1]; in the other direction, one could decompose a point p m into its 
three parts, pi, P 2 , and w. Figure 6 displays the Mix transformation mapping 
from two models and a weight to a model representing the mixture. 

As per examples to follow, the mixture transformation is especially useful 
in combination with the Fix transformation, for the purposes of fixing just the 
weights (fixing w = 0.5 is a popular choice), or fixing pi and P 2 and leaving the 
estimation only to find the most likely w. 

As with CROSS, one could recursively use the two-input Mix transformation 
to mix three or more models. 


4.5 Data Constraint 

Let the constraint function Cond : D —> {0,1} be an indicator function that is one 
when the input data is within some desired boundary and zero when it is not, and 
let the constrained data space O' be the set of data points where Cond(d) = 1. 

Figure 7 describes a transformation, DataTrunC, for the case where data out¬ 
side the constraint is suppressed. See Cameron and Trivedi [1998, §4.5] for a 
discussion of how maximum likelihood estimation of LdataTrunc recovers the pa¬ 
rameters of the unconstrained model, and Listing 17 for an example implementing 
and demonstrating the transformation. 

Proposition 4 shows that within B', RNG' and CDF' must be identical to RNG 
and CDF; and it is easy to verify that RNG' is ML-consistent with L'. 

Because the scaling for L depends upon p, for an arbitrary unconstrained 
model M, L DataTrU nc(m) (p|d)/LM(p|d) is not constant over all p. so Propo¬ 
sition 2 does not apply and the Est routines will differ between pre- and post¬ 
transformation models. 
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Mixcdf: M 2 -a M 

► O' = Oi = 0 2 

► P' = Pi x P 2 

► L'(d,p M ) = 

CDF^(0)-L 1 (d.p 1 ) + (l-CDF Ar (0))L 2 (d,p 2 ). 

► Est': use default of MLE. 

► RNG', CDF': as per 

FlX w=C DF Af (0)(Ml (by Proposition 

4). 


Figure 8: Definition of the MlXCDF mapping. 


Example 6: A censored distribution with a point mass One could write 
the case of a truncated Normal distribution where any d < 0 is discarded using 
Aftrunc = DataTrunc^oCMa/-). 

But consider a M at with all observations less than zero reported as exactly 
zero. One could use the models and transformation to this point to begin to express 
this model, which is the combination of a PMF based on one data point and a 
truncated Normal. 


Mf = FlX( w= i id=0 ) (Mpmf) 

M t = DATATRUNCdAolAfAt). 

It is not quite correct to say that the final model is a mixture of Mf and Mt, 
because the mixing weight for Mix(Mf, Mt) needs to be CDF(0, Mat), which 
is itself a function of /x and a. We do not yet have a transformation that sets a 
parameter at the value of a CDF, but it is trivial to write an ad hoc transformation 
for the situation; see Figure 8 for a description of the MlXCDF transformation. 

Then the final model of a A/”(/x, cr) with values less than zero replaced to zero 
is ATiMcut = MlXCDF(A// T , M f ). 


4.6 Differentiable transformation 

Consider a transformation function / : R^ —> M . N , with inverse / -1 : R^ —> 
M . n , and inverse derivative matrix (the Jacobian) J(/^ 1 ). The likelihood in the 
mapped-to space is the likelihood in the original space weighted by | det( J(/ _1 ))| 
See. for example, Greene [1990], The transformation of likelihoods can naturally 
be extended to a transformation of full models, as per Figure 9. 

The set of invertible functions over a given space D under the function com¬ 
position operator forms a group. Therefore, for a given base model Mf,, the set of 
models JACOBIAN / {Mb) over the same set of functions also form a group under 
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Jacobian: (M, / : R N -a R w ) -a M 

► D' = D = R n . 

► P' = P. 

► L'(d,p) = L(/ _1 (d), p)| det(J(/“ 1 )(d))|. 

► Est'= EsT(/ _1 (d)) 

► RNG' = /(RNG(p)) 

► CDF': Use default. 


Figure 9: Definition of the JACOBIAN mapping. 


Swap: M M 

► D' = P 

► r = D 

► L'(d',p') = L(p',d'). 

► Est', RNG', CDF': use defaults. 


Figure 10: Definition of the Swap mapping. 


the model composition function defined by JACOB IAN/(M(,)ojACOBIAN 9 (jV/{,) = 

Jacobian/ 09 (A4). 

Further, by the Radon-Nikodym theorem, for any two absolutely continuous 
models with respect to a given measure, there exists a transformation between 
them. Thus, some applications of other transformations in this paper are special 
cases of the JACOBIAN transformation. 

Example 7: Volume If the lengths of a set of cubes has a truncated-at-zero 
Normal distribution, DATATRUNCd>o(A/v), th en th e volumes are described by 

M cube = jACOBIAN /w=a , 3 (DATATRUNC d >o(Mv))- 

4.7 Swapping parameters and data 

Because data and parameters are not symmetric in the definition of a model, swap¬ 
ping them produces a new model, as per the SWAP transformation defined in Figure 
10. Notably, the estimation process searches for the most likely parameters given 
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DCOMPOSE: (M 2 , Nseq G N^) -A M 

► W = 0 . 

► P' = Pto X Pfrom. 

► L'(0, p) = 

Lfrom(RNGto(dtoi Pto , N Seq) , Pfrom)* 

► Est': use default of MLE. 

► RNG'(p) = 0. 

► CDF' = 0. 


Figure 11: Definition of the DCOMPOSE mapping. 


fixed data, so the post-swap model's estimation searches the orignal model’s data 
space given a fixed value in the original model's parameter space. 

This will be used in Section 4.8.2 to implement random draws from a data 
space via Markov Chain Monte Carlo and Section 5.1 for finding the most likely 
prediction from a model. 

4.8 Composition 

The transformations in this section consist of chaining together the output from 
one model (Mf rom ) as input to the next (M to ). Given that each model is associated 
with two spaces, there are four possible means of chaining output from the first 
model to input to the second. 

• If Bf rom is generated from the RNG of the first model, linking Df rom = 
Oto, allows us to evaluate the draws of one model using the likelihood and 
estimation routines of another model. 

• In a hierarchical model, the parameter estimate from the child model(s) is 
used as a data set for the parent model, in the form Pf r0 m = B to . 

• Matching Bf rom = Pto produces what is commonly referred to as Bayesian 
updating. 

• The pairing Pf rom = Pto is of limited utility. 

4.8.1 Data composition (Bf rom = B to ) 

The direction of Bf rom = B to is most commonly implemented by using random 
draws from the first model as the input data set for the second. Let Nseq G be 
an arbitrary sequence of natural numbers (in practice, draws from a PRNG). Then 
RNGto(dto) G B to , and because Bf rom = Bto, the draw can be used as an input 
to Lf rom . Figure 11 details the rest of this composition. 

Attaching a fixed sequence of draws to the model retains the model structure: 
the likelihood is deterministic and does not require additional PRNG inputs, and 
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similarly for the estimation routine. Each sequence Nseq therefore defines a new 
model. In practice, the class of models would likely be implemented on the fly, 
with random draws made as needed. In such a case, L and the MLE that depends 
on L are stochastic. Some ML search methods are better suited to stochastic search 
(e.g., simulated annealing, simplex algorithms) than others (e.g., conjugate gradi¬ 
ent searches). 

Example 8: Exponentially-distributed random graphs Distributions of 
network link counts in human networks are known to take a few known distribu¬ 
tions, including the Exponential, Waring, Yule, or Zipf distributions. 

So it is sensible to evaluate how well the simulated networks produced by the 
network generation model above, Mstm, fit to an Exponential distribution, Me xp - 

The network generation model produces data appropriate for composing with 
the Exponential distribution. The composed model, 

ATNcomp — DCOMPOSEjVseg(Ikfsi m , 

has B = 0 and P = R, and can be used to evaluate the likelihood of the network 
link counts generated by Msim using the likelihood function of an Exponential 
distribution. 

In the specification of Example 5, a was fixed at one, and so Fsim = 0; 
consider the variant M a sim where a is not fixed, so P CT s; m = <r, and Msim = 
FlX (T= i(M 0 -si m ). What is the simulation that comes closest to an Exponential 
model with A = 1? We could answer this question by writing down the model 
Mf ix \ = FlXx=i MExp ), and using its estimate routine to find the opti¬ 

mal parameter <7 op t = EST/j^x- Evaluating EsT/^a gives <r « 0.51. 

4.8.2 Parameter composition (D from = P to ) 

This form of composition is generally referred to by its immediate application: 
Bayesian updating. 

For example, let Mu ke i ihood = Fix ct= i(Mat) (so Pnkeiihood = R, represent¬ 
ing p), and let a Normal distribution with known p. and a be the prior model (so 
Bprior = R). Then the data space of the prior matches the parameter space of 
the likelihood. There are several ways to implement the elements of this model, 
discussed below. 

More generally, let A/ prior have p £ P pr i or and p £ B pr i or , and Mukeiihood 
have d £ Bi ike iihood and p £ Pnkeiihood- The calculation of L poste rior(p|p, d) 
can be broken down via Bayes’s rule to be proportional to: 

Lprior (pIp) " Lnkelihood(d|p.p), (1) 

We take p £ P pr ior as a fixed-with-certainty prior belief regarding the param¬ 
eters of the prior model, d as being observed with certainty, and use the fact that 
L need not integrate to one to specify the likelihood of the posterior model. Figure 
12 details this data-parameter composition transformation in full. 

The default/model-specific set for estimating this transformation has three parts. 
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DPCOMPOSE: (M 2 , p G IP prior) M 

► D' = Dlike 

► P' = D pri or 

^ L — ^prior(Pi P)Llike(djp) 

► Est': table of conjugates, rejection sampling, or 
MLE. 

► RNG', CDF': posterior draw, 
Metropolis-Hastings or draw from prior. 


Figure 12: Definition of the DPCOMPOSE mapping. 


Conjugate priors Some pairs of models are conjugate distributions , which 
have known, closed-form posteriors. Tables of conjugate pairs are readily avail¬ 
able. 3 In this case, the Est, RNG, L, and CDF elements of the prior-likelihood 
composition can be replaced with the corresponding elements of the posterior 
model. 

Likelihood prior Metropolis-Hastings Markov chain Monte Carlo (MHM- 
CMC) can be used to draw from a distribution where the likelihood function can 
be easily calculated to within some constant proportion, as in the case of Equation 
1. We could thus use it to make random draws from a prior-likelihood pair. See 
Metropolis et al. [1953] for details of the algorithm, which will be notated here as a 
function Metro : (M, M) — > P. Given this function, one can implement Bayesian 
updating using the tools described to this point. 

• Generate a model whose likelihood is the product of the input likelihoods 
via M c = DPCOMPOSE ( M pr ior, M HkeHhood ). 

• MHMCMC makes draws from the parameter space given fixed data, but we 
want to make draws from data space given fixed parameters, so generate 
Mac =SWAP {Me). 

• MHMCMC requires a proposal distribution. A Normal distribution is a pop¬ 
ular choice, so let M propo3e = Mj^. 

• Metro draws from the proposal and accepts or rejects the draw using a target 
distribution, so we have enough to generate an RNG for the model, drawing 
from the data space of M propose and testing it against M ac . 

Figure 13 is a diagram depicting the setup, a model where 

RNGp OS t = Metro{MM,SwAP{DPcoMPOSE{M pri or,Mii ke ))). 


3 E.g., 

distributions. 


http://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_ 
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Posterior model 

RNG: Metrofproposal, target) 

proposal: 



target: 


Swap 


Z/(ci, p) — L jjp corn p 0se (p, d) 

DPcompose 

L(d,p) = , 

Lpri{ P,p') • Li ike ( d|p) 


Prior j 

Lprior : D, P — > Rq" 

Likelihood 

Luke : D, P — > R+ 


Figure 13: A posterior distribution augmented with an RNG element based on 
Metropolis-Hastings MCMC. Each box is a model; many transform the models em¬ 
bedded within their box. 

RNG prior Consider a prior that has no explicit L element, but does have an 
RNG element. We will build a PMF model for the posterior which, as above, has 
a parameter set consisting of a list of pairs (wi, p;). For a fixed prior parameter p, 
make a draw of p, from the prior; by definition, this has likelihood proportional to 
Lprior(Pi? p)• Let Mf m (d,p) = FlXp(Afiike) and set the weight for this datum 
as Wi = Lf m (d|pi). Then, as the count of elements —> oo, a draw from the PMF 
(wi, pi) is proportional to L' = Lf m (d|pi)dL pr i 0 r(pi, p). 

MCMC methods are popular in part because the proposal distribution 'walks 
around’ in the target distribution, spending more time in the regions with greater 
weight, so a mismatched proposal (like a A/”(0,1) proposal for a A/”(5,1) target) 
will adapt to become a well-matched proposal. Conversely the fixed prior model 
does not move, and always makes more draws from the region(s) of the prior model 
with the most weight. This can be inefficient if there is a large mismatch between 
the prior and the likelihood. As the number of draws goes to infinity, drawing from 
the prior directly and drawing from a proposal distribution would be equivalent, but 
for a small number of draws, the MCMC chain generally gives smoother results. 
Flowever, the direct-draw method is feasible for RNG-based models for which 
MCMC is not, and is not sequential. Typical computers in the present day can run 
several threads in parallel (thousands at once in an increasing number of cases), 
and there may exist conditions where a larger volume of parallel stationary draws 
could generate a more accurate posterior than the adaptive sequential method. 

Example 9: Bayesian updating with a simulation Example 8 built M N com P 
to describe a network simulation where the likelihood of a set of randomly gener¬ 
ated networks was described via an Exponential distribution. 
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Figure 14: A PMF representing the posterior distribution of A in the 

DCOMPOSE7v se g (AFjVet5 ^e.rp ) model. 


We may have prior beliefs that A ~ FlX( |[i=mjCT=s )(Mjv') for some known 
parameters (/r = m, a = s). Then we can generate a new model as 

Mpostl = DPCOMPOSE(FlX (M=mj(T = s) (Afv), MNcomp). 

Figure 14 shows the PMF reduction of this model. 

Some authors draw a dichotomy between models that are on the table of conju¬ 
gate distributions, and can therefore be treated parametrically, and all other models 
whose posterior must be approximated by a nonparametric distribution. However, 
one can leave the model expressed using the prior’s P and the likelihood's D. Here, 

M pos t2 = DPcompose(Mm, M Nc0 mp), 

has P = R x Rj, representing /i and a. It is a parametric model like any 
other, and its parameters can be estimated using EST pos t2(0), and their confidence 
bounds can be calculated using (stochastically appropriate) methods from Section 
5.2. 

4.8.3 Hierarchical modeling (P c hiid = ©parent) 

Consider a school, consisting of several classrooms. Each classroom provides 
a distinct data set d\,. d,N , from which we can estimate the parameters of 
a model, such as a set of regression coefficients or characteristics of a network 
model. Stacking a model for each classroom produces M c lasses = CROSS(M c i ass i, 
..., M classN ), where EST c i asses (di x • • • xdjv) is a stack of parameters (EST(di), 
..., EST(djv)) that can then be used as a data set for a school-wide model. 

It can be shown [Box and Tiao, 1992] that in the case of linear models for both 
parent and child, the model can be reduced to a single linear model. This is a 
special case of the general form, where the parent and child models could take any 
arbitrary form, so long as the parameter space of the child models match the data 
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PDcompose: M 2 -> M 

► D' = D chi i d 

► P' = Pp ar ent 

► L (d, p) = Lparent (EST c hild(d), p) 

► EST (d) = ESTp aren t(EST c } 1 ild( < i)) 

► RNG': draw p^ = RNG pare nt(p). then return 

RNG c hiid(E_)- 

► CDF': use defaults 


Figure 15: Definition of the PDCOMPOSE mapping. 


space of the parent model. Figure 15 describes this parameter-data composition 
transformation in detail. 

Example 10: A narrative model The format of transforming simple models 
to produce more complex models suggests and facilitates a style of narrative mod¬ 
eling, in which a real-world problem is broken down into basic components and 
the manner by which the components are hypothesized to combine is explicitly 
described via corresponding model transformations. 

In this example, consider the arrival times of dinner party guests. One type 
of guest tries to be on time, but may face delays, which occur at the rate of A 
per minute. The second type of guest tries to be fashionably late, ideally arriving 
30 minutes late, but hitting that mark imprecisely. Nobody is early: those who 
miscalculate and arrive in the neighborhood before the stated start time take care 
to delay enough to arrive exactly on time. 

The likelihood function for an Exponential distribution describes the percent of 
a group not yet exited from a population given departures at a rate A as described 
above. The derivative of one minus L exp therefore describes the arrival rate at any 
given point in time, and the reader can verify that this is 


M a = JacobiaNi /a (Mex P ). 

The second group is reasonably described by a Normal distribution, with values 
less than zero fixed to exactly zero. This is the Afisicut model developed above. 
Assuming the two groups are evenly split, the overall group is 

ATg r p = Fix^,— o.5(Mix(JacobiaNx/a(ATex P ), ATNcut)). 

This model has D = Rq , representing minutes late; and P = Rj x R x R + , 
for (A, fj,, a). 

We should put priors on these parameters. For A, DATATRUNC a! >o(MA/') is 
a reasonable form for a prior; for fi, we could use Mat; for cr, we could use the 
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square root of a x 2 model, JACOBIAN ^(M x 2 ) with parameters E. Then the prior 
is 


Mpri = CROSS(DATATRUNC x >o(MAr),MAr, JACOBIAN N / 5 (MinvWish)) 

The overall model is then M = DPcOMPOSE(M pr i, AT grp ). It has parameters 
P representing (/ii, <ji, /12, 02, E), and D = RjJ" representing minutes late. 

There are a several things that one could do with M. One would be to fix 
the input parameters to reasonable prior beliefs, use rejection sampling to reduce 
M to a nonparametric PMF, and compare the model to data using the data-space 
tests described in Section 5 . Or, one could estimate the five input parameters via 
Est(M) and run parameter-space tests (also discussed below) to give confidence 
bounds on the prior parameters. 


5 Applications 

This section lists a few existing methods that operate on generic black-box models. 
Any of the models described to this point could be used for any of these applica¬ 
tions. This section assumes the reader is already familiar with the methods, and 
will only briefly discuss the considerations needed to construct algorithms around 
elements of M, and their use for the broad range of models that this paper contem¬ 
plates. 

As with many of the elements to this point, the goal is to establish an imple¬ 
mentation for all models; closed-form implementations for certain models can be 
filled in as needed. 

Before discussing a few more notable functions in detail, here are a few calcu¬ 
lations that can work with input of a generic model (or models) in a straightforward 
manner: 

• Entropy. 

• Kullback-Leibler divergence. 

• Cook’s distance and other leave-TV-out cross-validation techniques. 

• Derivatives and second derivatives of differentiable parameters. 

5.1 Prediction 

The prediction problem is effectively a problem of partially missing data. For ex¬ 
ample, given the independent variables of an OLS model, X, we may wish to pre¬ 
dict the most likely values of the missing dependent variable Y. We can generate 
a default algorithm for this problem using the tools above. 

Given a model M with known parameters p, estimated using some reference 
or training data, p = EST(d ro f), then Swap(AT) generates a model where the 
original p is treated as fixed data. Est S wap ( M ) (p) produces the most likely data for 
M. Now consider a data set D, divided into known and unknown parts, Dknown 
and D m iss. Then we can fix some parameters of Swap(AT) at Dknown, leaving 
a model whose only unknown parameters are those in D m i ss - Then the model is 
M P red = FlXD kn0 wn (Swap(M)), and the maximum likelihood estimate of the 
data elements to be predicted is EST pre d(p). 
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5.2 Testing 

Given that models in M are an intermediary between a data space and a parameter 
space, it makes sense to divide tests into those regarding the data space and those 
regarding the parameter space. 

5.2.1 Parameter-space tests 

A test about a claim regarding the parameters has three steps: 

1. Specify a statistic, such as a parameter estimate. 

2. State the distribution of the statistic estimate as specified by the model. 

3. Find the odds that the statistic lies within some given range of interest using 
the CDF of the statistic as specified by the model. 

For example, after an OLS regression, the assumptions of the model indicate 
that the parameter estimates are Multivariate Normal, and that fact is commonly 
used to state the likelihood with which each parameter differs from zero. 

We have a model’s parameter estimates, from Est(-), but also need the distri¬ 
bution of the estimates. Setting aside those models (such as OLS) where it can be 
derived via closed-form calculations, there are a few default methods available. 

Covariance via bootstrap and jackknife These methods take in a data set 
and an estimation routine [Efron and Gong, 1983, Efron and Tibshirani, 1993], 

For iteration i of the algorithm, both methods generate an artificial data set (the 
bootstrap, via sampling with replacement; the jackknife via a leave-n-out process), 
then run the estimation routine to calculate p;. Once several hundred or thousand 
such parameter estimates are derived, the covariance matrix is calculated for the 
given parameters. 

The final estimate of the statistic is the mean of the statistic calculated for each 
data subset, so central limit theorems typically apply. These techniques therefore 
apply to a wide range of models. Flow ever, they assume that draws from the data 
are analogous to draws from the true population. 

Simple replication For the network model in Examples 5, 8, and 9, where 
D = 0, the parameter distribution can be produced by repeated draws from the 
RNG. This is not bootstrapping, and does not rely on the core assumption of boot¬ 
strapping: in a sense, the draws from the RNG could be thought of as the true 
population asserted by the model. 

Covariance via inverse Hessian The Fisher Information matrix is the nega¬ 
tion of the inverse of the expected Hessian matrix (the matrix of second derivatives 
of the log likelihood function). 4 Especially for likelihood functions that are well- 
approximated by a Taylor expansion to the second (squared) degree, Fisher Infor¬ 
mation can be a good estimate of the parameter variance, and thus a good input 
to parametric hypothesis tests. Evaluating the quality of a second-degree Taylor 

4 Efron and Hinkley [1978] discuss the question of whether to use the expected value of the Hessian or its 
value at the ML estimate only. They find that in typical cases, the value at the ML estimate is to be preferred. 
For some algorithms, the Hessian is a computational side-effect of the maximum likelihood search, so its 
computation is cheap. 
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approximation in the neighborhood of an MLE can also be done using only the 
model and a data set. 

5.2.2 Data-space tests 

For models where P is either trivial enough that it is 0 or so complex that the 
dimensionality of a given element p £ P is unknown (i.e., for nonparametric 
models), it is difficult to use parameter-space tests. In this case, we can develop 
metrics to describe the distance between models via computations over the data 
space. 

One class of tests involve finding the distance from the fitted model to a target 
model constructed by producing a PMF from a set of observed data, distinct from 
the data used to estimate the model being tested. Given the two PMFs, there is 
then the problem of summarizing their differences in a single statistic. 

There are many options, and it it is beyond the scope of this paper to describe 
their relative merits, but this segment discusses the problem of applying any of 
them to an arbitrary element of M. 

Recall that the parameters of a PMF are a set of weights associated with points 
in the data space: ((rui, di), (ui 2 , CI 2 ),..., (wjv, djv)). Given a second model 
that has different weights at the same data points—i.e., a model with parameters 
((u>i, di), ( w' 2 , d 2 ),..., (w' N , djv))—one could calculate the distance between 
the vector (wi , ,wn) and (wi,..., w' N ) by a number of means: 

• fc-fold cross-validation tools typically report normalized Euclidian distance, 
aka root mean squared error, between the predicted and observed PMFs. 

• Kullback-Feibler divergence can be used to measure the information loss 
from the actual to the predicted observations. 

• The Kolmogorov-Smirnov test reports a statistic based on the maximum dis¬ 
tance between the CDFs of the two distributions. 

Given a PMF and an arbitrary distribution also defined over the same D, one 
could construct a PMF via ‘binning’. The simplest method of constructing a PMF 
from an arbitrary model Ma with parameters p is by setting wi = La(c1i, p), 
..., wn = FA(djv, p). Given a metric on D, one could also define the set of 
points closest to d , for each j. and use CDF .4 to calculate the total weight assigned 
to each set. Regardless of the binning method chosen, the problem of comparing 
a discrete PMF to a continuous PMF is now reduced to the problem of comparing 
two matched PMFs. 

For two arbitrary models over the same D, we can more easily reduce the prob¬ 
lem to one of comparing matched PMFs, by drawing points from the data using 
either or both RNGi and RNG 2 , or sampling a grid of data points covering D (pro¬ 
vided the space is finite). Either method again reduces the problem to a comparison 
of two matched PMFs. 


6 Conclusion 

This paper presented a definition of a model as a collection including a data space, 
a parameter space, Rq", and functions expressing the many ways in which model 
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users go between data and parameters. Although this may seem like a mere nota- 
tional convenience, it provides sufficient structure to allow for the construction of 
an extensive, internally consistent modeling language. 

The examples in this paper demonstrate the egalitarian goals of the algebraic 
system, as consistent functions and transformations are applied to models regard¬ 
less of whether they are Normal distributions, simulations, prior-likelihood com¬ 
binations, or nested combinations of all of these. Given a set of qualitative results 
from a simulation, it is natural to ask to which parameters the results are sensi¬ 
tive, and this paper shows that there are sensible ways to use standard statistical 
techniques for measuring variability given changes in parameters, and therefore 
the confidence with which a parameter estimate can be made. Bayesian updating 
with a microsimulation as a likelihood is not common in the literature, but is easy 
to operationalize in the framework here, to the point of being almost obvious. 

Writing down and applying the many model transformations presented in this 
paper was a near-trivial matter thanks to the formal definition of a model. The al¬ 
gebraic system readily accommodates generic methods that are well-known in the 
literature, including transformations via differentiable functions, mixture model 
estimation routines, and so on. There are no technical limitations preventing the 
implementation of such a system using virtually any modern computing platform. 

Yet, in preparing this paper. I was unable to find a computing system (beside 
the one demonstrated in the appendix) that offers a standard, genre-agnostic model 
form. Instead, computing systems are typically built around structures that are 
most convenient to immediate problems. This paper is a demonstration of how 
much can be built from a consistent model object, and an invitation to authors to 
consider making greater use of model objects such as the one described here as 
they develop new platforms or build new models using existing platforms. 


Appendix 

This appendix presents some technical notes on the implementation of a model 
object, and a number of examples of model creation and manipulation. The ex¬ 
amples are in standard C using the open-source Apophenia library of functions 
for scientific computing, which may create difficulties for those readers who have 
limited proficiency in C or no interest in using Apophenia. 5 The hope is that 
such readers will still be able to follow how models are created, transformed, 
and used, even if the details of syntax are unfamiliar. Readers who would like 
more on the details of syntax are referred to the Apophenia documentation, at 
http: //apophenia. info. 

Although the examples are in C, these routines could be written in any 
Turing-complete programming language. Readers proficient with other plat¬ 
forms are encouraged to consider how these programs could be written using their 
preferred platforms. 

Despite the difficulties, there are benefits to showing complete code examples. 
First, many of these examples produce uninteresting output; it is the process by 
which the output was produced that is of interest. Second, this code compiles and 
runs, demonstrating that the concepts in this paper can be beneficially applied to 
real-world problems. Third, pseudocode and mathematical reductions may have 

5 Standard C means code conforming to the ISO/IEC 9899:2011 standard. 
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hidden errors or omissions; using tested code, we can be much more confident that 
all relevant considerations have been addressed. 

Notes on implementing a model object The algebraic system in this paper 
focuses on a single object class, the model. Before presenting the examples, this 
segment first presents a few notes on the details of how the class of apop_model 
objects is implemented. 

The object is a structure collecting several elements: 

• Data 

• Parameters 

• Several functions, including those in Definition 2, but also a constraint func¬ 
tion for doing constrained optimization and some functions to handle logis¬ 
tics. There are both p and log_likelihood functions, which are largely 
redundant, but allow authors to use whatever is more common in their genre 
of modeling. 

• A hook for groups of settings, such as a group describing how a maximum 
likelihood search should be run, or a collection of the items requisite for 
running a simulation. 

• A flag for marking processing errors. 

Apophenia’s models are a C struct holding these elements. For example, the 
Normal distribution model is encapsulated in the apop_normal structure, which 
largely depends on the GNU Scientific Library (GSL) for the computational work 
[Gough, 2003]. The implementation of the Multivariate Normal is very similar, 
and largely uses the GSL as well. Apophenia implements the univariate and multi¬ 
variate cases separately, though one could in fact implement only the Multivariate 
Normal, because it reduces to a univariate Normal given one-dimensional input. 

The estimate routine has the form Est : (D, M) —» M, which differs from 
the estimation function described above. The parameters of the input model are 
NULL; the output model is a copy of the input model with the parameters set. 

Apophenia implements default routines for estimation, random draw, and other 
methods via dispatch functions. For example, here is a short program to read in 
a data set (in plain text format, whose first column is the dependent variable and 
subsequent columns are numeric independent variables), estimate an OLS model, 
and display the estimated model to the screen. 

#include <apop.h> 
int mainOf 

apop_data *d = apop_text_to_data("dataset.csv"); 
apop_model ^estimated = apop_estimate(d, apop_ols); 
apop_model_print(estimated, NULL); 

} 

The apop_estimate function checks the input struct, apop_ols for a non¬ 
null estimate function. If one is present, then the data is sent to that function. If 
the input struct has a null estimate function, then the data and model are sent to 
the default routine: apop_maximum_likelihood. 6 

6 Because maximum likelihood search is the the default for estimation, Apophenia includes several opti¬ 
mization methods borrowed from the GSL, including a few types of conjugate gradient methods, Newton's 
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Gentleman and Ihaka [2000] produce models like estimated via closures, 
which are functions bound together with an environment holding variables used 
in the function. Similarly, the apop_model is a struct holding both the functions 
in Definition 2 and the now-estimated parameters (and potentially other model- 
specific settings, discussed below). 

As an aside, all new functions in any language and for any purpose need to 
be tested, which can be difficult for numeric algorithms beyond nontrivial data. 
Having a default method and a model-specific method that theoretically achieve the 
same result provides a sensible testing procedure: test whether the default method 
and model-specific methods produce results within acceptable tolerances of each 
other. 

Some models and methods require variables and settings not listed in the core 
model struct. The apop_model struct can therefore hold a list of settings groups. 
Functions are provided for adding, modifying, and removing settings groups to a 
model. For example, a histogram model might have a settings group specifying 
bin locations. 

To give a more involved example, there are effectively two use-cases for an 
MCMC chain. One is to generate a PMF by making a large number of draws from 
the chain and recording them as a batch. The other is to make individual draws 
from the chain, in the style of a typical random number generation function. A 
settings group attached to the model from which draws are made holds the user- 
specified burnin period, the chain up to that point, and its final state. For use as 
a model, the entire run to that point is available; for use as an RNG, the state is 
available for making the next draw from the burnt-in chain. 

Many of the transformations described in the body of this paper output a model 
specially written for the transformation with a settings group pointing to the base 
model. For example, the input to apop_model_fix_params is a model estimated 
using a parameter set with some numeric values and some NaN values, where NaN 
is a not-a-number semaphore as specified in the IEEE 754 standard. Parameters 
with not-NaN values are fixed at the given values; parameters set to NaN are left 
free. The output from the transformation is a model with a settings group pointing 
to the original model. The output model has Est, L, RNG,... functions that do the 
requisite transformation on the inputs, call the base model, and return the (possibly 
transformed) result from the base model’s Est, L, RNG,_ 

The list of methods embedded in the model struct is deliberately restricted. 

Other things one might do with a model, including prediction, specifying submod¬ 
els for estimated parameters, calculating the score, and printing to the screen or a 
file, are implemented via a virtual table (vtable) mapping keys to functions. The 
keys are built by combining relevant elements of the model. E.g., for a Bayesian 
updating routine, the relevant elements are the likelihood functions of the prior 
and likelihood. If those functions appear on a table of conjugates, the vtable finds 

method, the Nelder-Mead simplex algorithm, and simulated annealing. The model object may have an as¬ 
sociated score function (the dlog likelihood), which is used by some of the search methods. Given a model 
m with no closed-form score, it is sometimes better to use a non-derivative method, and sometimes better to 
use apop_numerical_gradient (m) to approximate the score of m via delta method. 

Some large-dimension searches work best via a per-dimension search: fix all parameters at their expected 
value given the input model; do a search for the optimum of the first parameter; fix that parameter at its op¬ 
timum; search for the optimum of the second parameter; repeat until the end of the parameter list; repeat the 
loop until the change in objective function over a search is less than a user-specified tolerance. Implementing 
this simply requires repeated application of the Fix transformation. 
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a routine that produces the appropriate output, and if the likelihoods are not found 
in the vtable, the default of MCMC or drawing from the prior is used (depending 
on which elements are non-null in the prior). 

Example 11: An RNG-estimation round trip This first set of examples 
includes a truncation transformation function for models where D = R, and a set 
of uses. The uses fulfill the promise of egalitarian treatment of models both before 
and after truncation. 

Generally, the code examples here can be read from the bottom up, as the last 
function in the example will call functions earlier in the listing, which may call 
functions earlier still. The last function of Listing 16, truncate_model, takes in 
a model and returns a model truncated at a given cutoff. 

The remainder of the listing defines a model to be returned. The prep function 
does Apophenia-specific work, setting the output model’s parameter sizes, and 
storing the original, untruncated model at the output model's more pointer. The 
RNG for the truncated model makes draws from the model stored at more until 
one of the draws is greater than or equal to the cutoff. 

The example implements the DataTrunc transformation for the special case 
where O = R and the cutoff is of the form /(d) = 1 iff x > k for some cutoff 
k. Note that the truncated model produced by the transformation is lacking an 
estimation routine, so that must be filled in via defaults. 

Listing 17 demonstrates a test of a model, in which a few thousand draws are 
made using the model’s RNG, and then that drawn data set is used as input to the 
model’s estimation routine. If the parameters estimated match the parameters used 
at the beginning of the process, our confidence that the RNG and estimation are 
consistent rises. The notable point about this code is that main calls this round-trip 
function in the same manner with four models: a Normal distribution, a truncated 
Normal, a Beta distribution, and a truncated Beta distribution. 

The output of the program will not be printed here because, as with most of the 
examples in this appendix, the actual output is as expected or trivial. In this case, 
it prints the estimated /.t and a for the Normal and truncated Normal—(1.000660, 
0.999806) and (0.998964, 0.997244), respectively—and a and /? estimates for the 
Beta and truncated Beta—(0.700536, 1.706045) and (0.716461, 1.717567), re¬ 
spectively. 

Example 12: Estimating the parameters of a prior+likelihood model 

In this example. Nature generated data using a mixture of three Poisson distribu¬ 
tions, with A = 2.8, 2.0. and 1.3. Not knowing the true distribution, the analyst 
wrote down a model with a truncated Normal(2, 1) prior describing the parameter 
of a Poisson likelihood model ( M-p ). That model produces a posterior distribution 
over A. The analyst would like to present an approximation to the posterior in a 
simpler form, and so finds the parameters /r and a of the Normal distribution that 
is closest to that posterior, given the data. 

The storyline can be expressed as single function: 

Est(RNG (Mix (Fix 2 . 8 (M p ), Fix 2 . 0 (M-p), FiXi. 3 (ALp))), 
DPCOMPOSE (DATATRUNCd>o(FlX Al= 2 , CT= i(A/A^), M v ) , 

Listing 18 is is the transcription of this function. 
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#include <apop.h> 
double cutoff; 

double under_cutoff(double in){ return (in < cutoff); } 

long double like(apop_data *d, apop_model *m){ 

double any_under_cutoff = apop_map_sum(d, .fn_d=under_cutoff, .part=’a’); 
if (any_under_cutoff) return —INFINITY; 

//apop_cdf wants an apop_data set; we have a bare double 
gsl_vector_view gv = gsl_vector_view_array(&cutoff, 1); 

return apop_log_likelihood(d, m—>more) 

— (d—>matrix ? d—>matrix—>sizel : d—>vector—>size) 

* log(l — apop_cdf(&(apop_data){.vector=&gv. vector}, m—>more)); 

} 


static int rng(double *out, gsl_rng *r, apop model *m){ 
do apop_draw(out, r, m—>more); while (*out < cutoff); 

return 0; 

) 


static void prep(apop_data *d, apop_model *m){ 
apop_model *base_model = m—>more; 
apop_prep(d, base_model); 
m—>vsize - base_model—>vsize; 
m—>msizel = base_model—>msizel; 
m—>msize2 - base_model—>msize2; 
m—>parameters = base_model—>parameters; 


apopjmodel *truncated_model = &(apop_model){"A truncated univariate model", 
like, .draw=mg, .prep=prep}; 


.log_likelihood= 


apop_model *truncate_model(apop_model *in, double cutoff_in){ 
cutoff = cutoff_in; 

apop_model *out = apop_model_copy(truncated_model); 

out—>more - in; 

out—>dsize= in—>dsize; 

out—>constraint= in—constraint; 

return out; 


Listing 16: A truncation transformation. 
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#include <apop.h> 

apop_model *truncated_model; //The model and function defined in its own listing. 
apop_model *truncate_model(apop_model *in, double cutoff_in); 

void round_trip(apop_model *m){ 

//this copying and NULLifying is unnecessary; it’s here so you know I’m not cheating. 
apop_model *clean_copy = apop_model_copy(m); 
clean_copy—>parameters = NULL; 

apop_data *draws = apop_model_draws(m, 2e5); 
apop_model_show(apop_estimate(draws, clean_copy)); 


int main(){ 

printff N(l, l)\n"); 

apop_model *m - apop_model_set_parameters(apop_normal, 1,1); 
round_trip(m); 

printf("\n truncated N(l, l)\n"); 
apop_model *mt = truncate_model(m, 0); 
round_trip(mt); 

printf("\n Beta(.7, 1.7)\n"); 

apop_model *mb = apop_model_set_parameters(apop_beta, .7, 1.7); 
round_trip(mb); 

printf("\n Truncated Beta(.7, 1.7)\n"); 
apop_model *mbt = truncate_model(mb, .2); 
round_trip(mbt); 


Listing 17: A series of round trips. 
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#include <apop.h> 

apop_model *truncated_model; //these are from earlier. 
apop_model *truncate_model(apop_model *in, double cutoff_in); 

int main(){ 

apop_model_print ( 
apop_estimate( 
apop_update( 

apop_model_draws( 

apop_model_mixture( 

apop_model_set_parameters(apop_poisson, 2.8), 
apop_model_set_parameters(apop_poisson, 2.0), 
apop_model_set_parameters(apop_poisson, 1.3) 

), 

le4 

), 

truncate_model( 

apop_model_set_parameters(apop_normal, 2, 1), 

0 

), 

apop_poisson 
)—>data, 
apop_normal 

) 

, NULL 

); 

} 

Listing 18: A Normal approximation to an updated model, in a functional style. 
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This is the ‘functional' style of expression, where the program execution is 
the evaluation of a single complex function. The one difference from textbook 
functional code is that the use of the RNG function makes the program stochastic: 
setting apop_opts. rng_seed to an arbitrary integer would cause a small change 
in the final evaluation. 

Example 13: A demand-side ABM Listing 19 is a simple model of agents 
deciding the quantity of goods to buy given prices, preferences, and a budget. It 
is effectively an RNG, but Listing 20 does the same round-trip as earlier exam¬ 
ples, generating a data set using the model’s RNG, and then estimating the optimal 
parameters of the model given that data set. 

The model: 

• There are 1,000 agents. 

• Each agent i has a budget allocation b, and preference coefficient cm, each 
drawn from independent Normals. Fix a = 1 for both distributions, leaving 
us with two parameters from this part of the model: pi and p a . 

• Two goods are available for purchase; the first has price p and the second 
price one (without loss of generality from a setup with two prices pi and 
P2>. 

• Each agent’s utility front purchasing the good is U(qi, q 2 , a) = qf + q 2 - 
Agents are utility maximizing. 

• Agents maximize subject to the budget constraint that pq\ + qa < &i. 

• We observe the mean consumption Q i and Q 2 (total consumption divided 
by the number of agents). 

For this simple case where there are no interactions among agents, the con¬ 
sumption problem can be easily solved. Given the problem 

max u = q\ + 92 

B b = piqi + 92, 

the optimum, where du/dqi = 0 and the budget constraint is satisfied, is 

qi = ( pi/a) 1/(1 ~ a) 

92 = max(6 — piqi, 0). 

The core of the model, in Listing 19, is a single draw of a population and its 
decisions, in the draw function. The first several lines shunt parameters, then the 
agents are drawn, then the next loop does the optimization for each agent. This 
form is not very streamlined, but is easily extensible; for example, we might want 
to have a step where agents interact between the loop generating the population 
and the loop calculating their consumption decisions. 

Listing 20 wraps the simulation into a model. Looking at the p function, we 
see that a single likelihood for a given data/parameter set is found by making 500 
draws, then smoothing the PMF using a kernel density estimate in which a Mul¬ 
tivariate Normal is placed over every drawn point. The bulk of the code in this 
segment is spent setting up the Multivariate Normal and KDE. 

The main function executes a simple model application similar to the round- 
trips as in Listing 17, generating a small data set, then estimating the optimal 
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#include <apop.h> 

typedef struct { 

double b, alpha, ql, q2; 

} an_agent; 

int draw(double *qs, gsl_rng *r, apop_model *m){ 
double ml = apop_data_get(m—>parameters, 0); 
double m2 = apop_data_get(m—>parameters, 1); 
double pi = apop_data_get(m—>parameters, 2); 

apop_model *ba_model = apop_model_cross( 

apop_model_set_parameters(apop_normal, ml, 1), 
apop_model_set_parameters(apop_normal, m2, 1) 
); 


//set up agents by drawing from the above model 
int agent_count=1000; 
an_agent a_list[agent_count]; 
for(int i=0; i< agent_count; i++){ 
double out[2]; 
do { 

apop_draw(out, r, ba_model); 
a_list[i] = (an_agent){.b=out[0], .alpha=out[l]}; 
} while (a_list[i].alpha <=0); 


//agents decide 
qs[0]=qs[l]=0; 

for (int i=0; i< agent_count; i++){ 
qs[0] += a_list[i].ql = GSL_MIN( 

pow(p 1 / a_list[i]. alpha, 1./(1—a_list[i]. alpha)), 
a_list[i].b/pl); 

qs[l] += a_list[i].q2 = (a_list[i].b — pl*a_list[i].ql); 

} 

qs[0] /= agent_count; 
qs[l] /= agent_count; 
apop_model_free(ba_model); 

return 0; 


Listing 19: The core of a demand-side model. 
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parameters given that data set. Setting the dim_cycle element in the MLE settings 
group tells the optimizer to use the EM-style strategy of dimension-by-dimension 
optimization. 

Example 14: A search model Listing 22 is a spatial search model: 

• An equal number of agents of types A and B are randomly placed on a grid. 
They seek to pair with an agent of opposite type. 

• Agents in the middle of the grid have eight neighboring squares; agents on 
the edge or a corner have fewer because they are unable to go off the grid. 

• Until all agents are paired up: 

- If an agent is adjacent to another agent of the opposite type, they pair 
up, and are taken out of the model. We record how long it took for them 
to find each other. 

- Remaining agents take a single step to a random unoccupied neighbor¬ 
ing square. 

• The model output is the list of pairing times. 

We can expect that initially, when there are many agents on the grid, that some 
agents will find each other quickly. Eventually, there will be only one A agent and 
one B agent, and they will wander for a long time before pairing up. 

The ratio of agent count to grid size matters: 90 agents on a 10 x 10 grid is a 
very different search from 90 agents on a 1, 000 x 1, 000 grid. We will put a prior 
on these below. 

Much of the code is simple and mechanical; Listing 21 presents some macros 
and functions for use by agents searching for a mate or stepping to another cell in 
the grid. This listing is included for completeness, but is especially C-specific and 
can be skipped. The last line of Listing 22 wraps the simulation as a model £ M, 
with P = 0 and D representing the pairing times of each agent. 

• There is a random number generator associated with every agent. RNGs 
tend to not be parallelizable, but if every agent has its own, then one could 
still thread the agents’ activities. 

• The agents live in two structures, set up in run_sim. One is a non-changing 
array of agents. The second is a grid of pointers, with either NULL (vacant) 
or a pointer to one of the agents in the unchanging list of agents. We can 
easily loop through the agents via the array, and check positions and their 
surroundings with the grid. When agents are paired up, mark done=true in 
the agents’ records, leave them in the array, and erase their presence on the 
grid. 

One run produces a list of agent exit times. The rate at which agents exit 
changes with time. This is also the story of a Weibull distribution, so it would be 
interesting to see how well such a distribution fits. A Weibull has two parameters, 
k and A. If k = 1, then A is the mean time to exit. If k < 1, then the time to 
exit is slower for those that are still present later in the game: some (probably the 
more centrally-located) get picked up quickly, and the hard-to-match take several 
As’ time to find a match. So for this simulation, we expect that A grows as the grid 
gets more sparse, and k should be noticeably less than one. 
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#include <apop.h> 

int draw(double *qs, gsl_rng *r, apop_model *m); //from previous listing. 

//for the Kernel density: center the Normal distribution around a datum 

void set_fn(apop_data *d, apop_model *m){ 

gsl_vector_memcpy(m—>parameters—>vector, Apop_rv(d, 0)); 

} 


long double p(apop_data *d, apop_model *m){ 
apop_multivariate_normal—>vsize = 
apop_multivariate_normal—>msize 1= 
apop_multivariate_normal—>msize2= 2; 

apop_model * kernel = apop_model_set_parameters(apop_multivariate_normal, 

0 , 1 , 0 , 

0, 0, 1); 

apop_model * smoothed = apop_model_copy_set(apop_kemel_density, apop_kemel_density, 
.base_data = apop_model_draws(m, 500), .kernel=kemel, .set_fn=set_fn); 
double out = apop_p(d, smoothed); 
apop_data_free( smoothed—>data); 
apop_model_free( smoothed); 
return out; 


apop_model *demandside = &(apop_model) {"Demand given price", .vsize=3, .dsize=2, 

.draw=draw, .p=p}; 


int main(){ 

apop_data * draws = apop_model_draws(.count=20, 

.model = apop_model_set_parameters(demandside, 2.6, 1.6, 1.2)); 
apop_data_show(draws); 

Apop_settings_add_group(demandside, apop_mle, .tolerance=le—5, .dim_cycle_tolerance=le—3); 
apop_model_show(apop_estimate(draws, demandside)); 

} 

Listing 20: Setting up and using the RNG, L, and Est elements of the demand-side 
ABM. 
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typedef struct { 

gsl rng *mg; 
int x, y; 
char type; 
bool done; 

) agent_s; 

#defme gridpt(x, y) grid[(x)*grid_size + (y)] 

#define xoff_yoff_loop(...) \ 

for (int xoff=—1; xoff <=1; xoff++) \ 

for (int yoff=— 1; yoff <=1; yoff++) { \ 

if (a—>x + xoff >= grid_size II a—>x + xoff < 0 II \ 

a—>y + yoff >= grid_size II a—>y + yoff < 0 II (Ixoff & lyoff)) \ 
continue; \ 

_VA_ARGS_\ 


agent_s *search_for_mate(agent_s *a, agent_s **grid, int grid_size){ 
xoff_yoff_loop ( 

agent_s *b = gridpt(a—>x+xoff, a—>y+yoff); 
if (b && b—>type!=a—>type) return b; 

) 

return NULL; 


void v.cpiagcnt s *a, agent_s **grid, int grid_size){ 
int open_ct = 0; 
xoff_yoff_loop ( 

if (!gridpt(a—>x+xoff, a—>y+yoff)) open_ct++; 

) 

if (!open_ct) return ;//landlocked, can’t move 
int move = gsl_mg_uniform(a—>mg) * open_ct; 
xoff_yoff_loop ( 

if (!move-) { 

gridpt(a— >x, a—>y) = NULL; 
a—>x += xoff; 
a—>y += yoff; 
gridpt(a—>x, a—>y) = a; 
return; 


) 

} 

Listing 21: Mechanical functions for use by the search model in Listing 22 as 
search_f ns. c. Included for completeness. 
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#include <apop.h> 

#include <stdbool.h> 

#include "156—search_fns.c" //previous listing 

void generate_agents(agent_s **grid, int grid_size, int pop_size, agent_s *out){ 
for(int i=0; i< pop_size; i++){ 
agent_s *a = out+i; 

*a = (agent_s){.rng = apop_mg_alloc(apop_opts.rng_seed++), 

.type = (i % 2) ? ’ A’ : ’B’}; 
do{ //find a vacant spot 

a—>x - gsl_rng_uniform(a—>mg) * grid_size; 
a—>y - gsl_rng_uniform(a—>mg) * grid_size; 

} while (gridpt(a—>x, a—>y)); 
gridpt(a—>x, a—>y) = a; 

i 

out[pop_size] = (agent_s){}; //empty stopper. 

) 


int run_sim(double * durations, gsl_rng *r, apop_model *m){ 
int grid_size = apop_data_get(m—>parameters, 0); 
int pop_size = apop_data_get(m—>parameters, 1); 
int done_ctr = 0, period = 1; 
pop_size *=2; //guarantee evenness. 
assert(pop_size <= pow(grid_size,2)); 

agent_s alist[pop_size+1 ]; 

agent_s *grid[grid_size * grid_size]; 

memset(grid, 0, grid_size*grid_size*sizeof(agent_s*)); 

generate_agents(grid, grid_size, pop_size, alist); 

do { 

for (agent_s *a=alist; a—>rng; a++){ 
if (a—>done) continue; 
agent_s *b; 

if (a—>type==’A’ && (b=search_for_mate(a, grid, grid_size))){ 
gridpt(a—>x, a—>y) = gridpt(b—>x, b—>y) = NULL; 
a—>done = b—>done = true; 
durations [done_ctr++] = period; 

i 

step(a, grid, grid_size); 

} 

period ++; 

} while (done_ctr < pop_size/2); 

return 0; 

} 


apop_model search_sim = {"A search on a grid", .vsize=2, .draw=run_sim}; 

Listing 22: Agents searching for each other on a grid. 
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#include <apop.h> 


double one_weibull(double d, void *params){ 
double lambda = apop_data_get(params, 0); 
double k = apop_data_get(params, 1); 
return logl(k) — logl(lambda) 

+ (k—l)*(logl(d) — logl(lambda)) 

— powl(d/lambda, k); 


static long double positive_params(apop_data *data, apop_model *v){ 
return apop_linear_constraint(v—>parameters—>vector); 

} 


long double weibull_ll(apop_data *d, apop_model *m){ 

return apop_map_sum(d, .param = m—>parameters, .fn_dp=one_weibull, .part=’a’); 

} 


apop_model *weibull = &(apop_model){"The Weibull", .vsize=2, .log_likelihood = weibull_ll, 
.constraint=positive_params}; 

Listing 23: A Weibull model. 


Listing 23 specifies a Weibull model. As with most models that assume iid 
data, the log likelihood of one element can be written down, and then the total 
log likelihood is the sum of the map of that function onto every element. The 
apop_model struct includes a constraint to keep the MLE search routine from 
setting A or k to zero. 

In listing 24, the one_run function sets the settings for the simulation, makes a 
few thousand draws, and estimates the parameters of a Weibull using those draws, 
thus giving one point estimate of the parameters. 

The fuzzed function puts a prior on the grid size and population settings. The 
output is a PMF of 100 Weibull parameters. Figure 25 is a plot of draws from the 
posterior distribution of (A, k). 
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#include <apop.h> 

extern apop_model search_sim; 

extern apop_model * weibull; 


void one_run(int grid_size, int pop_size){ 

printf("-A run with a %\ X %i grid and %i agentsAn", grid_size, grid_size, pop_size); 

search_sim.dsize = pop_size; 

apop_data_set(search_sim.parameters, 0, .val=grid_size); 
apop_data_set(search_sim.parameters, 1, .val=pop_size); 

apop_model *model_out = apop_estimate(apop_model_draws(&search_sim, 1000), weibull); 
apop_model_show(model_out); 


apop_model *fuzz(apop_model sim){ 
int draws =100; 
gsl_rng *r = apop_rng_alloc(l); 
apop_model *prior = apop_model_cross( 

apop_model_set_parameters(apop_normal, 10, 2), 
apop_model_set_parameters(apop_normal, 10, 2)); 
apop_data *outdata = apop_data_alloc(draws, weibull—>vsize); 
double *params = sim.parameters—>vector—>data; 
for (int i=0; i< draws; i++){ 
do { 

apop_draw(params, r, prior); 

} while (params[l]*2 > pow(params[0], 2)); 
sim.dsize=params [ 1 ]; 

apop_model *est = apop_estimate(apop_model_draws(&sim, 1000), weibull); 
gsl_vector_memcpy(Apop_rv(outdata, i), est—>parameters—>vector); 
apop_model_free(est); 

i 

return apop_estimate(outdata, apop_pmf); 


int main(){ 

apop_prep(NULL, &search_sim); 
one_run(10, 10); 
one_run(100, 10); 
one_run(10, 45); 

apop_model *fuzzed = fuzz(search_sim); 
apop_data_print(fuzzed—>data, .output_name="outdata"); 


Listing 24: Fitting a Weibull distribution to data from the search model 
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Figure 25: The Weibull parameters given simulation inputs with independent Normal 
distributions. 
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